啊今天先來簡單介紹一下什麼事ArrayType()
ArrayType()
主要是將Array放入整個dataframe 中,然後Do Re Mi So
你的Dataframe中就可以儲存更多資訊了
那如果要將資料取出來會怎麼做呢?
會需要利用explode()
把資料取出,才能用row-based的Relational Database 的方式做儲存
data = [
("James,,Smith",["Java","Scala","C++"],["Spark","Java"],"OH","CA"),
("Michael,Rose,",["Spark","Java","C++"],["Spark","Java"],"NY","NJ"),
("Robert,,Williams",["CSharp","VB"],["Spark","Python"],"UT","NV")
]
from pyspark.sql.types import StringType, ArrayType,StructType,StructField
schema = StructType([
StructField("name",StringType(),True),
StructField("languagesAtSchool",ArrayType(StringType()),True),
StructField("languagesAtWork",ArrayType(StringType()),True),
StructField("currentState", StringType(), True),
StructField("previousState", StringType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()
df.select(df.name,explode(df.languagesAtSchool)).show()
'''
+----------------+------------------+---------------+------------+-------------+
| name| languagesAtSchool|languagesAtWork|currentState|previousState|
+----------------+------------------+---------------+------------+-------------+
| James,,Smith|[Java, Scala, C++]| [Spark, Java]| OH| CA|
| Michael,Rose,|[Spark, Java, C++]| [Spark, Java]| NY| NJ|
|Robert,,Williams| [CSharp, VB]|[Spark, Python]| UT| NV|
+----------------+------------------+---------------+------------+-------------+
'''
df.select(df.name,explode(df.languagesAtSchool)).show()
+----------------+------+
| name| col|
+----------------+------+
| James,,Smith| Java|
| James,,Smith| Scala|
| James,,Smith| C++|
| Michael,Rose,| Spark|
| Michael,Rose,| Java|
| Michael,Rose,| C++|
|Robert,,Williams|CSharp|
|Robert,,Williams| VB|
+----------------+------+
如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】